This tutorial is still under development and may change slightly as it is being edited.

A Brief Introduction

This tutorial is relevant to the published paper RNA exploits an exposed regulatory site to inhibit the enzymatic activity of PRC2. However, the workflow can be used for data processed by MaxQuant. The data for this tutorial is available here.The relevant files are located within the .zip files for each repeat as MaxQuant_txt_folder_output/peptides.txt. For simplicity, all “peptides.txt” files have been re-named according to their experiment numbers as part of the external data available as part of the R package.

In this tutorial, we w

Setting Up Our Environment

Load Libraries

First, load the libraries needed for the package and tutorial.

library(ggplot2)
library(bio3d)
library(Biostrings)
library(seqinr)
library(RColorBrewer)
library(openxlsx)
library(viridis)
library(stringr)
library(svglite)
library(crisscrosslinker)

Defining Our Variables

When loading the data, it is important to make sure that the files for input are named well so the functions will be able to correctly identify and organize the data.

sys.path <- system.file("extdata/NSMB_RBDmap",package = 'crisscrosslinker',mustWork = TRUE)


sys.files <- list.files(sys.path)
sys.files
##  [1] "Repeat01_NoXL_ArgC_eluate.txt"             
##  [2] "Repeat01_NoXL_ArgC_input.txt"              
##  [3] "Repeat01_NoXL_LysC_eluate.txt"             
##  [4] "Repeat01_NoXL_LysC_input.txt"              
##  [5] "Repeat01_UV_ArgC_eluate.txt"               
##  [6] "Repeat01_UV_ArgC_input.txt"                
##  [7] "Repeat01_UV_LysC_eluate.txt"               
##  [8] "Repeat01_UV_LysC_input.txt"                
##  [9] "Repeat02_NoXL_ArgC_eluate_MaxQuant_txt.txt"
## [10] "Repeat02_NoXL_ArgC_input_MaxQuant_txt.txt" 
## [11] "Repeat02_NoXL_LysC_eluate_MaxQuant_txt.txt"
## [12] "Repeat02_NoXL_LysC_input_MaxQuant_txt.txt" 
## [13] "Repeat02_UV_ArgC_eluate_MaxQuant_txt.txt"  
## [14] "Repeat02_UV_ArgC_input_MaxQuant_txt.txt"   
## [15] "Repeat02_UV_LysC_eluate_MaxQuant_txt.txt"  
## [16] "Repeat02_UV_LysC_input_MaxQuant_txt.txt"   
## [17] "Repeat03_NoXL_ArgC_eluate_MaxQuant_txt.txt"
## [18] "Repeat03_NoXL_ArgC_input_MaxQuant_txt.txt" 
## [19] "Repeat03_NoXL_LysC_eluate_MaxQuant_txt.txt"
## [20] "Repeat03_NoXL_LysC_input_MaxQuant_txt.txt" 
## [21] "Repeat03_UV_ArgC_eluate_MaxQuant_txt.txt"  
## [22] "Repeat03_UV_ArgC_input_MaxQuant_txt.txt"   
## [23] "Repeat03_UV_LysC_eluate_MaxQuant_txt.txt"  
## [24] "Repeat03_UV_LysC_input_MaxQuant_txt.txt"   
## [25] "Repeat04_NoXL_ArgC_eluate_MaxQuant_txt.txt"
## [26] "Repeat04_NoXL_ArgC_input_MaxQuant_txt.txt" 
## [27] "Repeat04_NoXL_LysC_eluate_MaxQuant_txt.txt"
## [28] "Repeat04_NoXL_LysC_input_MaxQuant_txt.txt" 
## [29] "Repeat04_UV_ArgC_eluate_MaxQuant_txt.txt"  
## [30] "Repeat04_UV_ArgC_input_MaxQuant_txt.txt"   
## [31] "Repeat04_UV_LysC_eluate_MaxQuant_txt.txt"  
## [32] "Repeat04_UV_LysC_input_MaxQuant_txt.txt"

Each of the file names contains 3 pieces of information: the name of the experiment, if it is an input/eluate file, and which protease was used for the experiment.

head(read.table(paste0(sys.path,'/',sys.files)[1], sep='\t',header = TRUE))
##                         Sequence         N.term.cleavage.window
## 1              CLSYYQLDGSNTIQCIK REKDFYRSGEQVAFKCLSYYQLDGSNTIQC
## 2                DAEEWFFTKTEELNR EMRDQYEKMAEKNRKDAEEWFFTKTEELNR
## 3                    DDTKCLASIAK NFCLFQSNSKDLLFRDDTKCLASIAKKTYD
## 4                DLDAECLHRTELETK SKRTDMEFTFVQLKKDLDAECLHRTELETK
## 5              FSPNVHGLIGQFMNEPK INVDFLGIYIPPTNKFSPNVHGLIGQFMNE
## 6 GGRQGSYHEQSVDRSGHSGSHHSHTTSQGR QEGQDTIRGHPGSRRGGRQGSYHEQSVDRS
##           C.term.cleavage.window Amino.acid.before First.amino.acid
## 1 SYYQLDGSNTIQCIKSKWIGRPACRDVSCG                 K                C
## 2 DAEEWFFTKTEELNREVATNTEALQSSRTE                 K                D
## 3 LLFRDDTKCLASIAKKTYDSYLGDDYVRAM                 R                D
## 4 DLDAECLHRTELETKLKSLESFVELMKTIY                 K                D
## 5 PNVHGLIGQFMNEPKIHVFNERPGKDPEKP                 K                F
## 6 GHSGSHHSHTTSQGRSDASHGQSGSRSASR                 R                G
##   Second.amino.acid Second.last.amino.acid Last.amino.acid
## 1                 L                      I               K
## 2                 A                      N               R
## 3                 D                      A               K
## 4                 L                      T               K
## 5                 S                      P               K
## 6                 G                      G               R
##   Amino.acid.after A.Count R.Count N.Count D.Count C.Count Q.Count E.Count
## 1                S       0       0       1       1       2       2       0
## 2                E       1       1       1       1       0       0       4
## 3                K       2       0       0       2       1       0       0
## 4                L       1       1       0       2       1       0       3
## 5                I       0       0       2       0       0       1       1
## 6                S       0       3       0       1       0       3       1
##   G.Count H.Count I.Count L.Count K.Count M.Count F.Count P.Count S.Count
## 1       1       0       2       2       1       0       0       0       2
## 2       0       0       0       1       1       0       2       0       0
## 3       0       0       1       1       2       0       0       0       1
## 4       0       1       0       3       1       0       0       0       0
## 5       2       1       1       1       1       1       2       2       1
## 6       6       5       0       0       0       0       0       0       7
##   T.Count W.Count Y.Count V.Count U.Count O.Count Length Missed.cleavages
## 1       1       0       2       0       0       0     17                0
## 2       2       1       0       0       0       0     15                1
## 3       1       0       0       0       0       0     11                1
## 4       2       0       0       0       0       0     15                1
## 5       0       0       0       1       0       0     17                0
## 6       2       0       1       1       0       0     30                2
##       Mass                Proteins Leading.razor.protein Start.position
## 1 2061.950             CON__Q28085           CON__Q28085           1026
## 2 1913.880 CON__Q6IFX2;CON__P02533           CON__Q6IFX2            280
## 3 1220.607 CON__Q29443;CON__Q0IIK2           CON__Q29443            640
## 4 1828.862           CON__Q6KB66-1         CON__Q6KB66-1            191
## 5 1913.946             CON__Q9TRI1           CON__Q9TRI1            852
## 6 3215.434             CON__P20930           CON__P20930           3767
##   End.position Unique..Groups. Unique..Proteins. Charges        PEP
## 1         1042             yes               yes       4 0.00961790
## 2          294             yes                no       4 0.00023835
## 3          650             yes                no       3 0.00810090
## 4          205             yes               yes       4 0.00023835
## 5          868             yes               yes       4 0.00961790
## 6         3796             yes               yes       4 0.01511400
##     Score Intensity Reverse Potential.contaminant id Protein.group.IDs
## 1 3.60420 623060000      NA                     +  0                14
## 2 4.10180 331750000      NA                     +  1                 5
## 3 0.00000    270080      NA                     +  2                12
## 4 1.18580  10152000      NA                     +  3                20
## 5 0.90048    890910      NA                     +  4                24
## 6 1.75540 158400000      NA                     +  5                 8
##   Mod..peptide.IDs Evidence.IDs MS.MS.IDs Best.MS.MS
## 1                0            0         0          0
## 2                1            1         1          1
## 3                2            2         2          2
## 4                3            3         3          3
## 5                4            4         4          4
## 6                5            5         5          5
##   Oxidation..M..site.IDs MS.MS.Count
## 1                     NA           0
## 2                     NA           0
## 3                     NA           0
## 4                     NA           0
## 5                     NA           0
## 6                     NA           1

If comparing between multiple experiments, indicate the experiment names in the code.

#experiment_names <- c('experiment1','experiment2')
experiment_names <- c('Repeat02_NoXL','Repeat02_UV')

We will also need to load the FASTA file that was used during the initial analysis to get the tryptic peptides. For more information on how this FASTA file was generated, please refer to the paper.

fasta_file <- seqinr::read.fasta(system.file("extdata/NSMB_FASTA",'PRC2_5m.fasta',package = 'crisscrosslinker',mustWork = TRUE))

If the file format is from MaxQuant with an Andromeda score, you will need to set it. The default setting is 20 and we will explicitly set it in this demonstration.

cutoff_score <- 20

Initializing the Data

Extracting the Hits

We are now ready to get a list of the relevant hits for our data.

sequence_hit_list <- rbd.makeSeqHitList(fasta_file = fasta_file,
                                        file_format = 'txt',
                                        cutoff_score = cutoff_score,
                                        experiment_directory = sys.path)

sequence_hit_list[[2]]
## $sequence
##  [1] "ASMSEFLESEDGEVEQQR"             "ATWETILDGK"                    
##  [3] "DEVLSADYDLLGEK"                 "DVSCPIR"                       
##  [5] "EAAFDDAVEER"                    "ECDPDLCLTCGAADHWDSK"           
##  [7] "ECSVTSDLDFPTQVIPLK"             "ESLTTDLQTR"                    
##  [9] "ESSIIAPAPAEDVDTPPR"             "EVSTAPAGTDMPAAK"               
## [11] "FDYSQCDIWYMRFSMDFWQKMLALGNQVGK" "FSMDFWQK"                      
## [13] "HPSKPDPSGECNPDLR"               "IHFPDFSTR"                     
## [15] "IKPSESNVTILGR"                  "LMIWDTR"                       
## [17] "LQLLDGEYEVAMQEMEECPISK"         "NFMLHLVSMHDFNLISIMSIDK"        
## [19] "NSESLHQENKPGSVK"                "NSESLHQENKPGSVKPTQTIAVK"       
## [21] "PDPSGECNPDLR"                   "PSTPTINVLESK"                  
## [23] "SLSPGAASSSSGDGDGK"              "STPAMMNGQGSTTSSSK"             
## [25] "TEILNQEWK"                      "VINEEYK"                       
## [27] "VMMVNGDHR"                      "WLGDLILSK"                     
## [29] "YSQADALK"                      
## 
## $intensity
##  [1] 1418600000 1513200000  199910000  208170000  578880000   24039000
##  [7]   90662000 1032800000  353530000  955100000    9762200  616470000
## [13]  235430000  540690000  269540000  248550000 3419100000    9372700
## [19]   25135000  168310000    3666600  372870000   27224000   33699000
## [25]  577730000  197920000   81929000 2322600000  217840000
## 
## $protein
##  [1] "SUZ12_Q15022"  "SUZ12_Q15022"  "EED_O75530"    "SUZ12_Q15022" 
##  [5] "RBBP4_Q09028"  "EZH2_Q15910-2" "EZH2_Q15910-2" "SUZ12_Q15022" 
##  [9] "EZH2_Q15910-2" "EED_O75530"    "EED_O75530"    "EED_O75530"   
## [13] "RBBP4_Q09028"  "EED_O75530"    "EED_O75530"    "RBBP4_Q09028" 
## [17] "SUZ12_Q15022"  "SUZ12_Q15022"  "SUZ12_Q15022"  "SUZ12_Q15022" 
## [21] "RBBP4_Q09028"  "EZH2_Q15910-2" "AEBP2_Q6ZN18"  "AEBP2_Q6ZN18" 
## [25] "EZH2_Q15910-2" "RBBP4_Q09028"  "EZH2_Q15910-2" "EED_O75530"   
## [29] "EZH2_Q15910-2"
## 
## $amino_acid_before
##  [1] "K" "R" "R" "K" "K" "R" "R" "K" "K" "R" "R" "R" "K" "K" "K" "K" "R"
## [18] "R" "R" "R" "K" "R" "R" "R" "R" "R" "K" "R" "R"
## 
## $amino_acid_after
##  [1] "R" "K" "K" "R" "R" "K" "K" "R" "R" "K" "K" "K" "R" "R" "R" "R" "K"
## [18] "K" "K" "K" "R" "K" "K" "K" "K" "K" "R" "K" "K"

Here we have found all of the tryptic peptides that:

  • Have R/K or K/R cleavage
  • Correspond to a protein name in the FASTA file
  • Have an Andromeda score > our cutoff

Making an Input/Eluate Table

Since we want to compare the input and eluate data from the experiments, let’s make a table comparing the two.

input_eluate_table <- rbd.makeIETable(sequence_hit_list, experiment_names)

We can see how the results look here:

head(input_eluate_table)
##                                                   sequence      input
## Repeat02_NoXL_ArgC.ASMSEFLESEDGEVEQQR   ASMSEFLESEDGEVEQQR  288150000
## Repeat02_NoXL_ArgC.ATWETILDGK                   ATWETILDGK 2490900000
## Repeat02_NoXL_ArgC.DEVLSADYDLLGEK           DEVLSADYDLLGEK  309200000
## Repeat02_NoXL_ArgC.DPNLLLSVSK                   DPNLLLSVSK   49414000
## Repeat02_NoXL_ArgC.EAAFDDAVEER                 EAAFDDAVEER 1168100000
## Repeat02_NoXL_ArgC.ECDPDLCLTCGAADHWDSK ECDPDLCLTCGAADHWDSK  149440000
##                                        eluate           category
## Repeat02_NoXL_ArgC.ASMSEFLESEDGEVEQQR       0 Repeat02_NoXL_ArgC
## Repeat02_NoXL_ArgC.ATWETILDGK               0 Repeat02_NoXL_ArgC
## Repeat02_NoXL_ArgC.DEVLSADYDLLGEK           0 Repeat02_NoXL_ArgC
## Repeat02_NoXL_ArgC.DPNLLLSVSK               0 Repeat02_NoXL_ArgC
## Repeat02_NoXL_ArgC.EAAFDDAVEER              0 Repeat02_NoXL_ArgC
## Repeat02_NoXL_ArgC.ECDPDLCLTCGAADHWDSK      0 Repeat02_NoXL_ArgC
##                                              protein aa_before aa_after
## Repeat02_NoXL_ArgC.ASMSEFLESEDGEVEQQR   SUZ12_Q15022         K        R
## Repeat02_NoXL_ArgC.ATWETILDGK           SUZ12_Q15022         R        K
## Repeat02_NoXL_ArgC.DEVLSADYDLLGEK         EED_O75530         R        K
## Repeat02_NoXL_ArgC.DPNLLLSVSK             EED_O75530         R        K
## Repeat02_NoXL_ArgC.EAAFDDAVEER          RBBP4_Q09028         K        R
## Repeat02_NoXL_ArgC.ECDPDLCLTCGAADHWDSK EZH2_Q15910-2         R        K

We can now move on to making a plot.

Making an Input/Eluate Plot

As you can see, many of the hits that were found in the input were not found in the eluate. We will filter those later. For now, let’s plot everything.

input_eluate_graph <- rbd.makeIEPlot(input_eluate_table, experiment_names)

print(input_eluate_graph)

We can change the default settings of the plot, including colors and the title by inputting some more variables:

name_of_experiment <- 'RBDmap Experiments: Input vs. Eluate'
input_eluate_graph <- rbd.makeIEPlot(input_eluate_table,
                                     experiment_names,
                                     palette = 'Pastel1',
                                     fill = 'white',
                                     colour = 'black',
                                     size = 1,
                                     title = name_of_experiment)

print(input_eluate_graph)

Finding the Binding Sites

Now that we’ve compared our control, let’s remove all results that do not show up in the eluate files.

filtered_input_eluate_table <- input_eluate_table[input_eluate_table$eluate > 0,]

head(filtered_input_eluate_table)
##                                                       sequence      input
## Repeat02_UV_ArgC.ECSVTSDLDFPTQVIPLK         ECSVTSDLDFPTQVIPLK 1101400000
## Repeat02_UV_ArgC.WLGDLILSK                           WLGDLILSK 1180000000
## Repeat02_UV_LysC.ECSVTSDLDFPTQVIPLK         ECSVTSDLDFPTQVIPLK 1649500000
## Repeat02_UV_LysC.LQLLDGEYEVAMQEMEECPISK LQLLDGEYEVAMQEMEECPISK 1057200000
## Repeat02_UV_LysC.WLGDLILSK                           WLGDLILSK 1288100000
##                                           eluate         category
## Repeat02_UV_ArgC.ECSVTSDLDFPTQVIPLK     13380000 Repeat02_UV_ArgC
## Repeat02_UV_ArgC.WLGDLILSK              14292000 Repeat02_UV_ArgC
## Repeat02_UV_LysC.ECSVTSDLDFPTQVIPLK     16849000 Repeat02_UV_LysC
## Repeat02_UV_LysC.LQLLDGEYEVAMQEMEECPISK  8763700 Repeat02_UV_LysC
## Repeat02_UV_LysC.WLGDLILSK              22946000 Repeat02_UV_LysC
##                                               protein aa_before aa_after
## Repeat02_UV_ArgC.ECSVTSDLDFPTQVIPLK     EZH2_Q15910-2         R        K
## Repeat02_UV_ArgC.WLGDLILSK                 EED_O75530         R        K
## Repeat02_UV_LysC.ECSVTSDLDFPTQVIPLK     EZH2_Q15910-2         R        K
## Repeat02_UV_LysC.LQLLDGEYEVAMQEMEECPISK  SUZ12_Q15022         R        K
## Repeat02_UV_LysC.WLGDLILSK                 EED_O75530         R        K

We’ll use these results for finding binding sequences. We’ll first get a data.frame of the binding sequences aligned to the FASTA file.

bs_output_fasta <- rbd.getBSfromIET(filtered_input_eluate_table,
                                    fasta_file = fasta_file,
                                    align_to = 'FASTA')

head(bs_output_fasta[,1:6])
##   binding_site_start binding_site_end
## 1                104              165
## 2                327              359
## 3                 66               85
## 4                310              312
## 5                298              317
##                                                 binding_sequence
## 1 TLNAVASVPIMYSWSPLQQNFMVEDETVLHNIPYMGDEVLDQDGTFIEELIKNYDGKVHGDR
## 2                              SCENAIVCWKPGKMEDDIDKIKPSESNVTILGR
## 3                                           QRRIQPVHILTSVSSLRGTR
## 4                                                            NRR
## 5                                           IHFPDFSTRDIHRNYVDCVR
##          ms2_peptide_seq ms2_start ms2_end
## 1     ECSVTSDLDFPTQVIPLK        86     103
## 2              WLGDLILSK       318     326
## 3     ECSVTSDLDFPTQVIPLK        86     103
## 4 LQLLDGEYEVAMQEMEECPISK       313     334
## 5              WLGDLILSK       318     326

Making a Heatmap

Let’s make a heatmap and sort the sequences by their names.

bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta,
                                name_by = 'pro_name',
                                heatmap = TRUE,
                                db_selection = 'FASTA')
## Saving 7 x 5 in image

We can also adjust the colors of the heatmap if we so wish. You can put in a standard RColorBrewer or viridis palette or a list of hexcodes and/or standard R Colors.

bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta,
                                name_by = 'pro_name',
                                heatmap = TRUE,
                                db_selection = 'FASTA',
                                colors = "Blues",
                                save_plot = FALSE)
## Warning in brewer.pal(num, colors): minimal value for n is 3, returning requested palette with 3 different levels

#do a grid of different color options?
bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta,
                                name_by = 'pro_name',
                                heatmap = TRUE,
                                db_selection = 'FASTA',
                                colors = c('#b3ecec','#43e8d8','#48d1cc'),
                                save_plot = FALSE)

bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta,
                                name_by = 'pro_name',
                                heatmap = TRUE,
                                db_selection = 'FASTA',
                                colors = c("#465ca8","#7767b8","#a272c4",
                                           "#cc7ecc",'#f38bd1'), #palette from color-hex.com
                                save_plot = FALSE)

If you require further customization, we recommend using the freqVector directly within a heatmap and customizing from there.

Aligning Peptides to a PDB File

We will now align the binding sequences to a PDB file. For simplicity, we’ll only focus on one protein: EZH2.

ezh2_iet <- filtered_input_eluate_table[filtered_input_eluate_table$protein == 'EZH2_Q15910-2',]

We already know the PDB file and chain that we want to align the proteins to. In this case, we can create and add a new table:

protein_dict <- read.csv(system.file("extdata/NSMB_misc",'PRC2_protein_dict.csv',package = 'crisscrosslinker',mustWork = TRUE))
head(protein_dict)
##      protein_id uniprot_id pdb_id
## 1 EZH2_Q15910-2   Q15910-2 6C23_C
## 2    EED_O75530     O75530 6C23_L
## 3  RBBP4_Q09028     Q09028 6C23_N
## 4  SUZ12_Q15022     Q15022 6C23_A
## 5  AEBP2_Q6ZN18   Q6ZN18-2 6C23_P

Now we don’t have to go through menus to be able to define the PDB ID/chain that we need to use. This will save us time when making our table of binding sequences.

You can always leave the dictionary blank in order to access an interactive menu to be able to use BLAST in order to align your sequence to a known PDB structure.

bs_ezh2 <- rbd.getBSfromIET(ezh2_iet,
                            fasta_file = fasta_file,
                            align_to = 'PDB',
                            protein_dict = protein_dict)
##   HEADER    GENE REGULATION                         05-JAN-18   6C23               
##   HEADER    GENE REGULATION                         05-JAN-18   6C23

Let’s see how many matches we had to the chain we chose:

bs_ezh2$db
## [1] "PDB" "PDB"

2 matched our chain! Let’s visualize it.

Making the PyMOL File

Now that we have the binding sequences aligned to a PDB file, we can create a PyMOL file.

rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = NULL)
## No color chosen --> reverting to PyMol Tints palette

## $rect
## $rect$w
## [1] 1.105
## 
## $rect$h
## [1] 0.2187342
## 
## $rect$left
## [1] 0.568
## 
## $rect$top
## [1] 1.432
## 
## 
## $text
## $text$x
## [1] 0.631 0.631
## 
## $text$y
## [1] 1.322633 1.267949
## 
## 
## No var with value 0 detected --> coloring gray

There is an additional legend with the colors corresponding to the binding sequences. The colors will be assigned based on the order the binding sequences are in. If you would like the sequences in a different order, please order them before executing rbd.pymol().

If we do not choose any colors, it will default to the PyMOL tints palette. However, we can input any RColorBrewer or viridis palette as well as standard PyMOL/R colors or hexcodes.

par(mfrow=c(2,2))
rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = 'Dark2', file.name = 'pymol_Dark2.pml') #RColorBrewer palette
## Warning in brewer.pal(num, colors): minimal value for n is 3, returning requested palette with 3 different levels
## $rect
## $rect$w
## [1] 2.138085
## 
## $rect$h
## [1] 0.5897368
## 
## $rect$left
## [1] 0.568
## 
## $rect$top
## [1] 1.432
## 
## 
## $text
## $text$x
## [1] 0.6899 0.6899
## 
## $text$y
## [1] 1.1371316 0.9896974
## Warning in Ops.factor(vars, 1): '>' not meaningful for factors
## No var with value 0 detected --> coloring gray
rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = 'magma', file.name = 'pymol_magma.pml') #viridis palette
## $rect
## $rect$w
## [1] 2.138085
## 
## $rect$h
## [1] 0.5897368
## 
## $rect$left
## [1] 0.568
## 
## $rect$top
## [1] 1.432
## 
## 
## $text
## $text$x
## [1] 0.6899 0.6899
## 
## $text$y
## [1] 1.1371316 0.9896974
## Warning in Ops.factor(vars, 1): '>' not meaningful for factors
## No var with value 0 detected --> coloring gray
rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = c('red','blue'), file.name = 'pymol_rgb.pml') #standard colors
## Warning in if ((typeof(colors) == "character") & (startsWith(colors, "#")))
## {: the condition has length > 1 and only the first element will be used
## $rect
## $rect$w
## [1] 2.138085
## 
## $rect$h
## [1] 0.5897368
## 
## $rect$left
## [1] 0.568
## 
## $rect$top
## [1] 1.432
## 
## 
## $text
## $text$x
## [1] 0.6899 0.6899
## 
## $text$y
## [1] 1.1371316 0.9896974
## 
## 
## No var with value 0 detected --> coloring gray
rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = c("#f8a1be","#ffb6b1"), file.name = 'pymol_hexcodes.pml') #hexcodes
## Warning in if ((typeof(colors) == "character") & (startsWith(colors, "#")))
## {: the condition has length > 1 and only the first element will be used
## $rect
## $rect$w
## [1] 2.138085
## 
## $rect$h
## [1] 0.5897368
## 
## $rect$left
## [1] 0.568
## 
## $rect$top
## [1] 1.432
## 
## 
## $text
## $text$x
## [1] 0.6899 0.6899
## 
## $text$y
## [1] 1.1371316 0.9896974
## Warning in Ops.factor(vars, 1): '>' not meaningful for factors

## No var with value 0 detected --> coloring gray

Differential Analysis

We have now looked at one experiment, but now we want to look at multiple experiments we have run and compare between them. First we will do what we have done before, but combine the different runs together to examine how often these binding sites appear.

Each of these experiments have slighly different prefixes, so we will adjust them accordingly with a list. In this instance, we could probably get away with just using “UV” for all 3 repeats. However, for the purposes of this tutorial, we will explicitly define the prefixes.

We’re now ready to proceed to get the binding sequences for each of these repeats.

bs_output_diff_analysis <- data.frame()
num_repeats <- 1:4
repeat_names <- paste0('Repeat0',num_repeats)

for(rname in repeat_names){
  shl <- sequence_hit_list[startsWith(names(sequence_hit_list),rname)]
  enames <- paste0(rname,'_',c('UV','NoXL'))
  input_eluate_table <- rbd.makeIETable(shl, enames)
  filtered_input_eluate_table <- input_eluate_table[input_eluate_table$eluate > 0,]
  
  bs_output_pdb <- rbd.getBSfromIET(filtered_input_eluate_table,
                                  fasta_file = fasta_file,
                                  align_to = 'PDB',
                                  protein_dict = protein_dict)
  
  bs_output_diff_analysis <- rbind(bs_output_diff_analysis,data.frame(bs_output_pdb))
}

Now that we have loaded the data into our R environment, let’s make a PyMOL file:

rbd.pymol(bs_output_diff_analysis, color_by = 'freq', colors = 'magma', file.name = 'pymol_magma_diff_analysis.pml')
## Saving 7 x 5 in image

##   HEADER    GENE REGULATION                         05-JAN-18   6C23               
##   HEADER    GENE REGULATION                         05-JAN-18   6C23               
##   HEADER    GENE REGULATION                         05-JAN-18   6C23               
##   HEADER    GENE REGULATION                         05-JAN-18   6C23

It will also create a heatmap with the same color scheme as the PyMOL file if heatmap is kept as TRUE. A legend will also be generated to go with the PyMOL file, though it will be similar to the one generated by the heatmap.

Since slightly different but overlapping peptides may show up due to the use of different proteases, the frequency is measured by the amino acid rather than the full peptide.

References

  1. Castello, A. et al. Comprehensive Identification of RNA-Binding Domains in Human Cells. Molecular Cell 63, 696-710, doi:10.1016/j.molcel.2016.06.029 (2016).

  2. Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols 11, 2301-2319, doi:10.1038/nprot.2016.136 (2016).